What prevents us from reusing medical real-world data in research

Medical real-world data stored in clinical systems represents a valuable knowledge source for medical research, but its usage is still challenged by various technical and cultural aspects. Analyzing these challenges and suggesting measures for future improvement are crucial to improve the situation. This comment paper represents such an analysis from the perspective of research.


Reusing medical real-world data for medical data science
The main tasks in facilitating, or even enabling, the reuse of medical RWD in a research context are to promote interoperability, harmonization, data quality, and ensure privacy, to optimize the retrieval and management of patient consent, and to establish rules for data use and access 12,13 . These measures aim to address the various challenges of scientifically reusing routine clinical data described below.
Challenges in balancing benefits and harms. Personal, i.e. non-anonymized medical data, is inherently sensitive 1,17,22 . As a result, uncertainties in MDS project preparation and execution arise for all roles involved in performing research on medical RWD, i.e. for patients, researchers and governing entities. The patients may lack trust in research using their personal data. Concerns about data misuse, becoming completely transparent and data leakage -especially in the case of long-term storage -can result in the patients overprotecting their own data and not giving their consent for its reuse in research [23][24][25] . On the other hand, it has also been shown that most EU citizens support secondary use of medical data if it serves further common good 24 . So, convincing patients about the social expediency of MDS can decrease their ambivalence and avoid overprotection. This can be achieved, for example, by reporting on MDS success stories 13 . A second important aspect is patient empowerment by informing patients about the processing and use of their data through open scientific communication and enabling their active engagement in the form of a dynamic consent management 12,23 .
However, there are also concerns on the part of the researcher resulting e.g. from a lack of explicit training in a complex landscape of ethical and legal requirements. These could be mitigated by discussions in interdisciplinary team meetings but differences in the daily work routine make it difficult to arrange such meetings 8,9,18,21 . As a consequence of unresolved concerns, researchers could delay or even cancel their MDS projects. Moreover, even governing entities such as data protection officers and ethics committees exhibit a certain level of uncertainty regarding permissible practices in MDS. They tend to overprotect the rights of the patients whose medical data is to be used while underestimating the necessity of reusing medical RWD for research purposes 9,23,26,27 . This leads to restrictive policies hindering scientific progress.
In general, education is a promising approach to address the uncertainties mentioned above. Technical training for medical researchers and governing entities as well as ethical and legal training for technical experts can increase confidence in project-related decision making 1,18,23,24,27,28 . The same effect can be achieved by developing MDS guidelines and actionable data protection concepts (DPC) [13][14][15][16] . A good example is the DPC of the MI-I that was developed in collaboration with the German working group of medical ethics committees (AK-EK) 12 . Figure 1 summarizes the sources and consequences of the aforementioned uncertainties that lead to significant challenges in the reuse of medical RWD. Each source of uncertainty is associated with the roles it affects and possible measures to mitigate its impact. The challenges posed by these uncertainties are discussed in more detail below.
Uncertainties due to the legal framework. As mentioned above, the complex legal landscape resulting from various intervening laws contributes significantly to the uncertainty surrounding the reuse of medical RWD. At the European level, the General Data Protection Regulation (GDPR) holds substantial influence over the www.nature.com/scientificdata www.nature.com/scientificdata/ legal framework. In general, it prohibits the processing of health-related personal data (GDPR Art. 9 (1)) unless the informed consent of every affected person is given (GDPR Art. 9 (2a)) or a scientific exemption is present (GDPR Art. 9 (2j)). The latter is the case if the processing is in the public interest, secured by data protection measures, and adequately justified by a sufficient scientific goal. However, substantiating the presence of such a scientific exemption poses significant challenges 29,30 . Similarly, or even more difficult, is obtaining informed consent of patients after they have left the clinics. As such, both GDPR-based possibilities to justify the secondary use of RWD in research are difficult to implement in practice 26,29 . If the processing is legally based on the scientific exemption, GDPR Art. 89 further mandates the implementation of appropriate privacy safeguards supported by technical and organizational measures. Additionally, it stipulates that only the data necessary for the project should be utilized (principle of data minimization) 30,31 . This ensures the protection of sensitive personal data, but also introduces further challenges for the researchers.
The situation becomes further complicated due to the GDPR allowing for various interpretations by the data protection laws of EU member states 30,31 . Moreover, there are country-specific regulations, such as job-specific laws, that impact the legal framework of MDS 31 . This complex scenario poses particular challenges for international MDS projects 29 . As a result, identifying the correct legal basis and implementing appropriate data protection measures becomes exceptionally difficult 29,30 . This task, crucial in the preparation of clinical data set compilation, necessitates not only technical and medical expertise but also a comprehensive understanding of legal aspects. Thus, a well-functioning interdisciplinary team or researchers with broad training are essential.
Analyses of the current legal framework for data-driven medical research suggest that this framework is remote from practice and thus inhibits scientific progress 31,32 . To address these limitations, certain legal amendments or substantial infrastructure enhancements are necessary. Particularly, the infrastructure should focus on incorporating components and tools that facilitate semi-automated data integration and data anonymization. Although the current legal framework permits physicians to access, integrate, and anonymize data from their own patients, they often lack the technical expertise and time to effectively carry out these tasks. By implementing an infrastructure that enables semi-automated data integration and anonymization, researchers would be able to legally utilize valuable medical RWD without imposing additional workload on physicians 29,30 . Attaining a fully automated solution is not feasible since effective data integration and anonymization, leading to meaningful data sets, necessitate manual parameter selection by a domain expert. Nonetheless, by prioritizing maximal automation and specifically assigning domain experts to handle the manual steps in the process, rapid and compliant access to medical RWD, along with reduced uncertainties for researchers, can be achieved.
Ethical considerations and overprotectiveness. Not only the legal framework, but also ethical considerations can cause uncertainties. These can affect the patients and researchers but, in the context of an MDS project, especially the ethics committees as they have to judge whether a project is ethically justifiable. There are a variety of ethical principles to be taken into account for such a decision. These principles encompass patient privacy, www.nature.com/scientificdata www.nature.com/scientificdata/ data ownership, individual autonomy, confidentiality, necessity of data processing, non-maleficence and beneficence 1,33 . Considered jointly, they result in a trade-off to be made between the preservation of ethical rights of treated patients and the beneficence of the scientific project 15,18,26 . Criticism often arises concerning the prevailing trade-off in favor of patients' privacy, where ethics committees tend to overprotect patient data 23,27 . What is frequently overlooked is the ethical responsibility to share and reuse medical RWD to advance medical progress in diagnoses and treatment. Thus, a consequence of overprotecting data is suboptimal patient care which is, in turn, unethical 1,9,26 . Measures to prevent overprotection are increasing the awareness of its risks through education, as well as the development of clear ethical regulations and guidelines 28 . To facilitate the latter, the data set compilation process for medical RWD should be simplified, e.g. by standardization of processes and data formats because its current complexity challenges the creation of regulations and guidelines 17 .
Uncertainties in project planning. Many of the mentioned concerns related to legal and ethical requirements occur during project planning and design. Here a variety of decisions are made regarding the composition of the RWD set and its processing. These affect all subsequent project steps, but must be determined at an early stage if the project framework necessitates approvals from governing entities. This is because the governing entities require all planned processing steps to be documented in a study plan, serving as the foundation for their decision-making process. This results in long project planning phases due to uncertainties in a complex multi-player environment [13][14][15][16]21 . Additionally, creating a strict study plan usually works for clinical trials, but in data science, meaningful results often require more flexibility. For instance, it might be necessary to redesign the project plan throughout data processing. Therefore, project frameworks that show researchers how to reshape their project in specific cases would be much better suited for secondary use of medical RWD 25,34 .
Taking it a step further, a general guideline or regulation on how to conduct MDS projects would decrease planning time and the risk of errors, both of which are higher if each project is designed individually 14 . To already now minimize the uncertainties in project planning and, thereby, the duration of the planning phase, research teams should communicate intensely and collaboratively plan their tasks 9,18 . Since this is a challenging task in a highly interdisciplinary environment, early definition of structures, binding deadlines, and clear assignment of responsibilities, such as designating a person responsible for timely data provision in each department, are crucial 8,14 .
The role of the patient consent. As mentioned in the introduction to this section 3.1, dynamical consent management allowing the patients to effectively give and withdraw their consent at any point in time is a crucial measure to foster patient empowerment. As a result, it also leads to more acceptance of MDS by the affected individuals. Furthermore, in section 3.1.1 the informed patient consent was mentioned as a possible legal justification for processing personal sensitive data. However, the traditional informed consent requires patients to explicitly consent to the specific processing of their data. This means their consent is tied to a specific project 35,36 .
For retrospective projects such a consent cannot be obtained during the patients' stay at the hospital because the project idea does not exist at that time. Hence, the researcher would have to retrospectively contact all patients whose data is needed for the project, describe the project objective and methodology to them and then ask for their consent. This requires great effort, is, itself, questionable in terms of data protection and even not feasible if the patients are deceased. Making clinical data truly reusable in a research context, therefore, requires a broad consent in which the patients generally agree to the secondary use of their data in ethically approved research contexts. Furthermore, the retrieval of such a broad consent must be integrated into daily clinical routine and the consent management needs to be digitized. Otherwise, the information about the patient consent status might not be easily retrievable for the researcher 8,18,21,37 .
Previous research has documented that most patients are willing to share their data and even perceive sharing their medical data as a common duty 38 . Therefore, it is highly likely that extensively introducing a broad consent such as the one developed by the MI-I in Germany into clinical practice, combined with a fully digital and dynamic consent management, would have a significant positive impact on the feasibility of MDS projects 39 . It would allow patients to actively determine which future research projects may use their data. technical challenges. When describing the challenges resulting from balancing benefits and harms in MDS projects, some measures were suggested that require technical solutions. One example for this is the implementation of data protection measures like data access control, safe data transfer, encryption, or de-identification 20 . However, there are not only technical solutions but also challenges, as shown in Fig. 2.
One category of technical challenges results from the specificities of medical data outlined in section 2. Medical RWD is characterized by a higher level of heterogeneity regarding data types and feature availability than data from any other scientific field 18,19,26 . Thus, compiling usable medical data sets from RWD requires the technical capabilities of skillful data integration, type conversion and data imputation. However, heterogeneity is not restricted to data formats. A common problem is differences in the primary purpose of data acquisition or primary care leading to different data formats and standards being used 8 . This results in different physicians, clinical departments, or clinical sites not necessarily using the same data scales or units, syntax, data models, ontology, or terminology. Hence, it is difficult to decide which standards to use in an MDS project. A subsequent challenge arising from this lack of interoperability is the conversion between standards that potentially leads to information loss 19,26,40 . Last but not least, heterogeneity is also reflected in different identifiers being used in different sites. This challenges the linkage of related medical records, which may even become impossible once the data is de-identified 41 . Promising and important measures to meet the challenges concerning heterogeneity are the development, standardization, harmonization and, eventually, deployment of conceptual frameworks, data models, formats, terminologies, and interfaces 8,13,14,16,42 . An example illustrating the feasibility and effectiveness of these measures is the widely used DICOM standard for Picture Archiving and Communications systems www.nature.com/scientificdata www.nature.com/scientificdata/ (PACS) 18 . Similar effects are expected from the deployment of the HL7 FHIR standard for general healthcare related data that is currently being developed 43 . However, besides appreciating the benefits of new approaches, the potential of already existing standards like the SNOMED CT terminology should not be neglected. It still has limitations, such as its complexity challenging the identification of respectively fitting codes and its incompleteness partly requiring to add own codes. On the other hand, SNOMED CT is already very comprehensive. Once its practical applicability is improved, SNOMED CT could be introduced as an obligatory standard in medical data systems fostering interoperability 13,16,42 .
Another significant technical challenge is the fact that a majority of medical RWD is typically available in a semi-structured or unstructured format, while the application of most machine learning algorithms necessitates structured data 8,19,42,44 . Primary care documentation often relies on free text fields or letters because they can capture all real-world contingencies while structured and standardized data models cannot. Additionally documenting the cases in a structured way, is too time-consuming for clinical practice. So, the primary clinical systems mainly contain semi-structured or unstructured RWD 7,13,23 . To increase the amount of available structured data, automated data structuring using Natural Language Processing (NLP) is a possible solution. However, it is not easy to implement for various reasons. Among them are the already mentioned inconsistent application of terms and abbreviations in medical texts and the requirement to manually structure some free text data sets to get annotated training data 13,42 .
Workflows in primary care settings not only lead to predominantly semi-structured or unstructured documentation of medical cases, but also greatly influence the design of clinical data management systems. In primary care and administrative contexts, such as accounting, clinical staff typically need a comprehensive overview of all data pertaining to an individual patient or case. As a result, clinical data management systems have been developed with a case-or patient-centric design that presents data in a transaction-oriented manner. However, this design is at odds with the need for query-driven extract-transform-load (ETL) processes when accessing data for MDS projects. These projects typically require only a subset of the available data features, but for a group of patients 8,26 . Developing a functional ETL pipeline is further complicated by the overall lack of accessible interfaces to the data management systems and the fragmented distribution of data across various clinical departments' systems 8,13 .
This means the design of primary clinical systems could be improved significantly if it allowed for more flexibility, i.e. support patient-and case-centricity for primary care as well as data-centricity for secondary use. Moreover, the system design should comply with data specifications and developed standards rather than requiring the data to be created according to system specifications 13 . However, a complete redesign of primary clinical systems is most likely not feasible. An alternative solution is creating clinical data repositories in the form of data lakes or data warehouses that extract and transform medical RWD from primary systems and make it usable for research 45,46 . In this context, the use of standardized platforms and frameworks such as OMOP or i2b2 further increases the interoperability of the collected data 47 . In Germany, the MI-I established DIC and MeDIC whose goal is the creation of such data repositories for the medical RWD gathered at German university hospitals. As a common standard they agreed on the HL7 FHIR based MI-I core data set (CDS) 48 . Because this is work in progress and the data repositories are populated with data from primary clinical systems, the DIC and MeDIC still need to address the challenges identified in this comment paper to create FAIR data repositories for research.

Can we enable practical and FaIR research on medical real-world data?
The previous section has shown that compiling medical RWD sets for research carries several cultural and technical challenges. We can see that classical medical research and data science on RWD have not yet reached agreement. At university hospitals, there is still a clear focus on primary care and traditional clinical trials that is at odds with the demands of data science. Besides the technical and regulatory conflicts, there is the conflict between the principle of data minimization in medical research contradicting the explorative big data approach of data science. Thus, it should be assessed by governing entities whether the beneficence of explorative big data outweighs the ethical benefits of data minimization. www.nature.com/scientificdata www.nature.com/scientificdata/ Another important measure to enable FAIR MDS is to offer data systems, e.g. data repositories, meeting the needs of data scientists. These systems should enable comprehensive query-driven data exports and increase interoperability by using shared coding systems and terminologies. To simultaneously foster compliance to legal and ethical requirements, the systems should follow the paradigm of Privacy by Design, i.e. enforcing data protection e.g. by authorization, authentication and only allowing de-identified data to be exported. A resulting positive effect would be a decrease in uncertainties for the researchers since they would have to deal with fewer concerns about data protection and security. As long as the data infrastructure does not follow Privacy by Design, the uncertainties about the secondary use of routine clinical data remain for researchers, e.g. when determining the correct legal basis for the processing of medical RWD or designing the project aiming for ethical compliance. A possible measure to decrease these uncertainties is the simplification of project approval processes, e.g. by only requiring a single project application to be sent to an interdisciplinary deciding committee covering ethics, data security and data protection. Further simplification could be achieved by requesting flexible project frameworks rather than strict project plans from the researchers in the design phase. On the part of patients and governing entities, uncertainty regarding the justification of an MDS analysis often manifests itself in the form of overprotection. Section 3.1 described that an important measure to mitigate all such concerns is offering trainings for researchers, governing entities and patients. Moreover, enhanced patient engagement in form of open science communication and dynamic consent management could further decrease the ambivalence of patients. Secondly, a digital and dynamic consent management would increase the availability and reliability of the information whether a patient currently consents to the secondary usage of their data.
Considering FAIRness as the gold standard for scientific usability of data, the current usability level of medical RWD for MDS can be improved significantly: • Findability: The data system infrastructure at university hospitals is so fragmented that most data features are only findable with intense communication or experience, either from previous projects or clinical routine. Systematic investigation on available features in the individual data systems and the creation of data repositories as carried out by the DIC and MeDIC of the MI-I could help to increase findability. • Accessibility: The access to medical data is currently complicated by uncertainties regarding privacy protection, complex ethico-legal requirements and the design of primary clinical systems lacking query orientation and accessible interfaces. Redesigning the systems or creating data repositories aiming for Privacy by Design and technical accessibility of clinical data would significantly ease the compilation of medical RWD sets for research. • Interoperability: The interoperability is currently mainly restricted to the usage of the same patient identifiers within a hospital. Different departments often use different documentation policies, abbreviations, units, or own case IDs while different hospitals use different patient identifiers. Standardization as an agreement on common terminology, data models and coding systems would help to increase interoperability. • Reusability: Given the current legal situation, true reusability is only achievable with anonymized data sets or a broad patient consent allowing the processing of patient data in ethically approved MDS projects. Otherwise, data sets are compiled and used on a project-specific basis. Once the legal basis for creating a reusable data set is established and implemented, metadata documenting data provenance should be created to further promote reusability.
To conclude, reusing medical RWD in MDS is not infeasible, but the current situation still poses a variety of challenges. This comment paper has outlined these challenges from the research perspective with a special focus on the situation in Germany and proposed high-level measures on how to effectively address them. Implementing these measures will itself be a big challenge but significantly increase the usability of medical RWD for MDS and hence promote improvements in future healthcare. Thereby the technical changes will be easier to implement than the cultural ones.

acknowledgements
We acknowledge support for the Article Processing Charge from the DFG (German Research Foundation, 491454339).

Funding
Open Access funding enabled and organized by Projekt DEAL.